Efficient Queries of Stand-off Annotations for Natural Language Processing on Electronic Medical Records
نویسندگان
چکیده
In natural language processing, stand-off annotation uses the starting and ending positions of an annotation to anchor it to the text and stores the annotation content separately from the text. We address the fundamental problem of efficiently storing stand-off annotations when applying natural language processing on narrative clinical notes in electronic medical records (EMRs) and efficiently retrieving such annotations that satisfy position constraints. Efficient storage and retrieval of stand-off annotations can facilitate tasks such as mapping unstructured text to electronic medical record ontologies. We first formulate this problem into the interval query problem, for which optimal query/update time is in general logarithm. We next perform a tight time complexity analysis on the basic interval tree query algorithm and show its nonoptimality when being applied to a collection of 13 query types from Allen's interval algebra. We then study two closely related state-of-the-art interval query algorithms, proposed query reformulations, and augmentations to the second algorithm. Our proposed algorithm achieves logarithmic time stabbing-max query time complexity and solves the stabbing-interval query tasks on all of Allen's relations in logarithmic time, attaining the theoretic lower bound. Updating time is kept logarithmic and the space requirement is kept linear at the same time. We also discuss interval management in external memory models and higher dimensions.
منابع مشابه
Genetics Identification of Genomic Predictors of Atrioventricular Conduction Using Electronic Medical Records as a Tool for Genome Science
Background—Recent genome-wide association studies in which selected community populations are used have identified genomic signals in SCN10A influencing PR duration. The extent to which this can be demonstrated in cohorts derived from electronic medical records is unknown. Methods and Results—We performed a genome-wide association study on 2334 European American patients with normal ECGs withou...
متن کاملAnnotating progressive aspect constructions in the spoken section of the British National Corpus
We present a set of stand-off annotations for the ninety thousand sentences in the spoken section of the British National Corpus (BNC) which feature a progressive aspect verb group. These annotations may be matched to the original BNC text using the supplied document and sentence identifiers. The annotated features mostly relate to linguistic form: subject type, subject person and number, form ...
متن کاملAnaphoric relations in the clinical narrative: corpus creation
OBJECTIVE The long-term goal of this work is the automated discovery of anaphoric relations from the clinical narrative. The creation of a gold standard set from a cross-institutional corpus of clinical notes and high-level characteristics of that gold standard are described. METHODS A standard methodology for annotation guideline development, gold standard annotations, and inter-annotator ag...
متن کاملExtracting Concepts Related to a Homelessness from the Free Text of VA Electronic Medical Records
Mining the free text of electronic medical records (EMR) using natural language processing (NLP) is an effective method of extracting information not always captured in administrative data. We sought to determine if concepts related to homelessness, a non-medical condition, were amenable to extraction from the EMR of Veterans Affairs (VA) medical records. As there were no off-the-shelf products...
متن کاملMayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
We aim to build and evaluate an open-source natural language processing system for information extraction from electronic medical record clinical free-text. We describe and evaluate our system, the clinical Text Analysis and Knowledge Extraction System (cTAKES), released open-source at http://www.ohnlp.org. The cTAKES builds on existing open-source technologies-the Unstructured Information Mana...
متن کامل